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Abstract 


Tree structures are ubiquitous in data across many domains, and many datasets 
are naturally modelled by unobserved tree structures. In this paper, first we review 
the theory of random fragmentation processes [Bertoin, 2006] , and a number of ex¬ 
isting methods for modelling trees, including the popular nested Chinese restaurant 
process (nCRP). Then we define a general class of probability distributions over 
trees: the Dirichlet fragmentation process (DTP) through a novel combination of 
the theory of Dirichlet processes and random fragmentation processes. This DFP 
presents a stick-breaking construction, and relates to the nCRP in the same way 
the Dirichlet process relates to the Chinese restaurant process. Furthermore, we 
develop a novel hierarchical mixture model with the DFP, and empirically compare 
the new model to similar models in machine learning. Experiments show the DFP 
mixture model to be convincingly better than existing state-of-the-art approaches 
for hierarchical clustering and density modelling. 

The process of random fragmentation is common to many areas, snch as the degradation 
of large polymer chains in chemistry, or the evolntion of phylogenetic trees in biology. 
An elegant mathematical tool for describing snch phenomena is the fragmentation process 
(FP) [Bertoin, 2006]. As a concrete example of a FP, consider a stick of nnit length. At 
every time point, the stick breaks into two smaller pieces. Then, each of the resnlting 
smaller sticks independently repeats the procednre, and the process continnes ad inhni- 
tnm. This process can be described with the FP framework, and generalised to arbitrary 
distribntions over the splits of the stick, breaking times, and number of splits. 

The process of fragmentation can be interpreted as inducing a tree structure. In the 
probability theory community, Aldous [1991] has worked on binary fragmentation trees 
and used a symmetric beta distribution as the fragmentation operator for binary trees. 
McCullagh et al. [2008] has worked on the theoretical aspect of Bertoin [2006] ’s relation 
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Figure 1: Recursive Stick Breaking. The plot on the left shows an example of recursive 
breaking; At the first level, the unit-size stick breaks into infinitely many sub-sticks. The 
first 3 sticks are illustrated and the remaining sticks are represented by dots. Then at the 
second level, a similar stick breaking process is applied to each sub-stick. This recursive 
stick-breaking process repeats until a pre-determined maximum depth is reached. The 
plot in the middle shows the resulting tree structure by discarding stick sizes. The plot 
on the right shows the sequence indexing scheme. 


to tree priors, studying both binary and multifurcating trees. Teh et ah [2011] have 
recently began studying the relation between fragmentation and coagulation processes, 
and relating these to practical applications in machine learning. Apart from the last work, 
the literature has mostly concentrated on theoretical aspects of the FP, and pragmatic 
aspects of the process have been largely overlooked. 

The rest of this paper is organised as follows. Section 1 briefly reviews the result 
of fragmentation processes (FP) as introduced in Bertoin [2006], and the nested Chi¬ 
nese restaurant process [Blei et ah, 2010]. This lays the way for a general probabilistic 
framework for modelling trees. In Section 2, we derive a useful variant of fragmentation 
processes - the Dirichlet fragmentation process (DFP) - through a combination of the 
theory of Dirichlet processes and fragmentation process. A notable property of the DFP 
is that it relates to the nCRP in the same way the Dirichlet process relates to the Chinese 
restaurant process, that is the DFP forms the underlying de Finetti measure of the nCRP. 
Inspired by this property, in Section 3 we develop a hierarchical infinite mixture model 
with the DFP prior as its mixing distribution, in the same spirit as using the Dirichlet 
process prior as the mixing distribution for an infinite mixture model. Furthermore, in 
Section 4 we describe an associated effective yet simple sampling procedure for the DFP 
mixture model. Finally, in Section 5 we assess the model with a set of experiments for 
density estimation and hierarchical clustering, demonstrating an improvement on existing 
state-of-the-art approaches. 


1 Preliminaries 

We begin by briefly reviewing the fragmentation process and nested Chinese restaurant 
process upon which our new model is based. The relation between the two will become 
clear in the next sections. 

Throughout this paper, we will use finite-length sequences of natural numbers as our 
index set on the nodes in a tree, i.e. we let uj = (uji,uj 2 , ■ ■ ■ ,ujl) denote a length-L 
sequence of positive integers, ui E N. We denote the zero-length string as a; = A and use 
|ci;| to indicate the length of sequence 00 . When viewing these strings as node indices in 
a tree, (cjcnp Ui E N) are the children of 00 , and A ^ 00 ' -< 00 are the ancestors of 00 , and 
A is the root of the tree. 
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1.1 Fragmentation Processes 

To give a more concrete description fragmentation processes, first recall the stick-breaking 
example of fragmentation processes. We use 7z(t) to denote the set of sub-sticks present 
at each time t G M"*" (the set of non-negative real numbers), that is 7r(f) = (7r„(f))„gisj 
where the subscript n indexes resulting sub-sticks. Then the stick-breaking process If = 
{7^{t))t£R+ is an example of a (mass) fragmentation process. Motivated by this informal 
description, we dehne a fragmentation operator on sequences of real numbers in the general 
setting, and then give a formal dehnition of a (mass) fragmentation process over some 
space S. We do this by adapting the formulation in [Bertoin, 2006, p. 119]. 

First consider the space S of non-increasing non-negative sequences that sum to 
one given by 5 := {tt = (vrj)jgis}|7ri > 712 ... > 0, = 1}. For each bounded sequence 

(7rj)jgN of non-negative real numbers we denote by (vrj)|gpj the re-ordering of (7rj)jgN in 
a decreasing manner; we thus have that (7ri)|gpj G 5 if and only if = 1- We 

now dehne a fragmentation operator on the space S, and then give the dehnition of a 
fragmentation process (FP). 

Definition 1.1 (Random Fragmentation Operator). Let Frag(-, •) be a fragmentation 
operator dehned as follows: 

Frag(7r, (fi-(*^)ieN) := (tt* ■ (1-1) 

where (7f^*^)jgN are i.i.d. copies of some random sequence n. That is, for every integer z, 
Frag(-, •) dehnes the distribution over the partitions of the z-th block tt* of tt induced by 
the z-th i.i.d. copy The resulting partitions are the scaled sequences tt* - (7f|*\ 7r2*\ ...). 
Collecting these partitions for each vTj and rearranging them in decreasing order, we get 
the right hand side of Equation (??). 

Definition 1.2 (Random Fragmentation Process, FP). We call an iS-valued Markov 
process 7r(f) := (vr„(t))neN) a (mass) fragmentation process if the following two conditions 
hold: 

i. 77(0) = (1,0,0,...). 

ii. For any t,u E M"*", conditioned on 7r(f), the random variable 7r{t + u) has the following 
distribution: 

7r(f + u) = Frag(7r(t), ( 7 rW(ti))-gj^) (1.2) 

where = means equality in distribution. 

In the fragmentation process, each sequence 7z(t) corresponds to a specihc sorted split 
of a stick as brought in the stick-breaking example before. Intuitively, a fragmentation 
process can be understood through the stick-breaking example; in each splitting event the 
stick TTj is replaced with a (possibly inhnite) sequence of shorter sticks that sum to tt*. The 
splitting event is independent of the splitting time, which in a more general setting would 
be given by a deterministic function. We will assume all sticks split concurrently according 
to such a function. The selection of the deterministic function used for the splitting rate, 
or the divergence function, will be explained further in the following Section. Note for 
practical purposes, in this work we focus on the discrete time FP, that is, splitting events 
are only allowed at discrete time steps (which corresponds to a fragmentation chain, c.f. 
Bertoin [2006]). 
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1.2 Nested Chinese Restaurant Processes 

The nested Chinese restaurant processes [nCRP; Blei et ah, 2010] is a Dirichlet “path¬ 
reinforcing” traverse of a tree where each data point starts at the root and descends to 
the leaves. More specihcally, the hrst data point descends from the root and creates a 
new node with probability 1; the same data point repeats this process up to a pre-dehned 
depth resulting in a leaf node (obtaining a chain graph). A later data point i starts from 
the root, and descends according to a Chinese restaurant processes (until maximum depth 
L). That is, if the data point reaches cj, it will either descend to an existing child or create 
a new child with probabilities: 


p{u}Ui\u}) 


nojoji /+ ct(l^l)) descends to child ujUi 
a{\u:\)/{n^^. + Q;(|a;|)) creates a new child 


(1.3) 


Here denotes the number of data points descending from node u to node cjcj, for all 
data points preceding data point i, and denotes a marginal count. This formulation 
leads to the well known “rich get richer” self-reinforcing property, which has been proved 
useful in various applications such as topic modelling and genetic mutation clustering 
[Teh, 2010]. 

Probability of the Combinatorial Structure For each node uj we refer to the set 

of ancestor nodes {lo': A ^ cj' ^ cj) - including the root and cj itself - as a path. The 
probability of each path is simply the product of probabilities given in Equation (??): 

p(A -)■ uj) = ]Q p{uj'ui\oj') (1.4) 


For each node, we refer to the collection of child nodes uJUJi, and the counts associated 
with each child node as its branching structure. Since the branching structure is created 
by a CRP, we can write down the probability of the combinatorial branching structure 
analytically 






T{n^. + a) 


(1.5) 


where is the number of children nodes of cj, and a is the concentration parameter. 


2 Dirichlet Fragmentation Processes 

There exist many distributions satisfying the second condition set in Dehnition 1.2, each 
leading to a distinct family of fragmentation processes with different properties. One 
notable example of such distributions is the Poisson-Dirichlet (PD) distribution and its 
2 parameter extension^ [Pitman and Yor, 1997]. The PD distribution and its extensions 
have been shown to be powerful Bayesian nonparametric tools for mixture models (e.g. 
the popular Dirichlet process (DP) mixtures). Motivated by this success of the PD dis¬ 
tribution, in this paper we derive a Dirichlet fragmentation process (DFP) dehned as 
follows. 


Definition 2.1 (DFP). We call a fragmentation process a Dirichlet fragmentation pro¬ 
cess if at each time t the Frag operator induces a Poisson-Dirichlet distribution over the 
partitions. 

^The 2-parameter PD distribution is also known as the Pitman-Yor process (PYP). 
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A useful property of the random fragmentation process is that it satishes the Markov 
property - given a stick cj, subsequent fragmentation events are independent from cj’s 
ancestors in the tree. 

2.1 Recursive Stick-Breaking Construction 

We gave an imprecise description of the stick breaking process in Section 1.1; now we give 
a formal dehnition to the process and use it as a constructive procedure for sampling from 
the Dirichlet fragmentation process. The stick-breaking process dehned by Sethuraman 
[1994] is a constructive way for drawing samples from the DP. A random probability 
measure G can be drawn from a DP given a base probability measure H and concentration 
parameter a using a sequence of beta draws: 

oo k—1 

k=l i=l (,4.1) 

z/fc ~ Beta(l,Q;), (j)k ~ H. 

This can be viewed as taking a stick of unit length and breaking it at a random location. 
We call the left side of the stick tti and then break the right side at a new place, call the left 
side of this new break 7 ^ 2 - We then continue this process of “keep the left piece and break 
the right piece again”. Sethuraman [1994] showed that the sequence of weights obtained 
from the stick breaking process (tti, 712 , ■■ ■) distributes according to the Poisson-Dirichlet 
distribution [Pitman and Yor, 1997]. Thus the stick breaking procedure can be used as a 
Dirichlet Frag operator. This a is a useful property since we can apply this stick breaking 
Frag operator in a recursive way to induce a tree structure. This property has been noted 
and studied by Adams et al. [2010]. Here we provide a modihed tree structured stick 
breaking procedure and use it as a way for sampling from the DFP. 

Now we describe the recursive stick breaking process. The hrst step is to sample a 
beta random variable ~ Beta(l, a) for each node in the tree with the exception of 
the root node. Then the length of the stick associated with node ujuji is given by 

UJi-l 

(1 ^U}k)l (4-2) 

k=l 

where is the stick length of the parent node. Through multiplying over beta variables 
of all prehxes of cj, the recursive dehnition given in Equation (??) can be unpacked as 

UJi — 1 

TTo; = n n (1 “ (2-3) 

(jj'uji^u} k=l 

More generally, the concentration parameter a is allowed to vary for different nodes. For 

example, a can be a function of the depth of a given node, denoted by q;(|ci;|). When the 
concentration parameter is inhnitesimal for each node (e.g., a(|ci;|) = a{t^)dt, whereas t 
can be seen as a hctitious time associated node oj), and the maximum depth of tree is 
sufficiently large, the recursive stick breaking will generate binary trees with probability 
1. This special case of the DFP is known as the Dirichlet diffusion tree Neal [DDT, 2003]. 
Following a convention hrst introduced by Neal [2003], we shall call this function Q;(|a;|) 
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Figure 2: DFP Gaussian Diffusion Examples. Generation of a two-dimensional dataset 
from the Gaussian diffusion with the number of discrete time steps L = 40, a = 1 and 
a{l) given by Equation (??) in the footnote. The plot on the left shows the first 20 data 
points generated, along with the underlying tree structure. The right plot shows 1000 
data points obtained by continuing the procedure beyond these 20 data points. 


the divergence function^. The recursive stick-breaking process and the tree node indexing 
scheme are illustrated in Figure 1. 

2.2 Parent-Child Transition Operators 

Recall that for the Dirichlet process mixture model an unbounded number of partitions 
is generated where each partition is labelled with some parameter ~ H. Given the 
generated data partition and corresponding labels, each data point is assumed to arise as 
a draw from a distribution F{y\(j)k), where (pk is the fc’th component label from which y 
is generated. In the DFP we continue to assume that the data are generated indepen¬ 
dently given the latent labelling, but take advantage of the tree-structured partitioning 
of the data. That is, the distribution over the parameter at node cjcuj, denoted 
should depend on its parent cj. This parent-child dependence will be captured through 
a Transition Operator, denoted ^ cp,^) := p{(puju]P\(pu>)■ For example, the Gaussian 

transition operator is given by 

T{(pu:ui, ^ (puj) = A/'(0c.,, (T^), p(A) = A/'(0, (T^) (2.5) 

where p(A) denotes the parameter distribution of the root node. An example of 1000 
data points sampled from the DFP with a Gaussian transition operator is given in Figure 

2 . 


3 A DFP Mixture Model 

Given a DFP prior over the tree structure, we can obtain a hierarchical inhnite mixture 
model by coupling the model with a mixture model component likelihood function, for 

^An example of such a function is: 

a{l) = a{{l + l)/L)-a{l/L), (2.4) 

where L is the number of discrete time steps, and a is defined as: a(s) = c/(l — s)ds, for some hyper 
parameter c. 
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Figure 3: Results on Aggregation (top row), and R15 datasets (bottom row). The left 
plot shows the original data; the middle plot gives trees sampled from the posterior 
conditioning on the training data; the right plot shows the predictive densities resulting 
from our DFP mixture model. 

example a Gaussian data distribution i.e., 

R(|/|0.^) = A/'(0^,cr^). (3.1) 

Here the subscript uj denote the index of the leaf node associated with data point y. We 
assume the dimensionality of the data to be 1 to keep notation simple. We will use this 
notation in the remaining part of the paper since the extension to arbitrary dimensionality 
is straightforward. 

4 Inference by Gibbs Sampling 

Recall our variables of interest; the variables yi are our observations, and we let Zi denote 
the node (i.e. mixture component) from which y^ was generated - each yi is assumed to 
arise as a draw of F(|/j|0^J. Here the vector 0 = (0,^) stores the parameters of each node. 
We use riuj. to denote the number of leaves descended from node cj, and to denote the 
number of children of node uj. Furthermore, we use ujk to denote the fc’th child of uj. 

Let y = yi,jq be the sequence of data items, = (i/p Zi = oj) be the sequence of 
data items generated from node oj, and z = zi:jy be the sequence of nodes generating y. 
We attach a superscript to a set of variables or a count (e.g. y“* or n”*) to denote the 
removal of the variable corresponding to the superscripted index from the variable set or 
from the calculation of the count. In our examples y“* := y\yi and n~^ is the number of 
observations (i.e. leaves) ultimately reached by node oj, leaving out data point y^. 

In the case of the Gaussian observation model, which is conjugate to the distribution 
of the leaf parameters, we integrate out the leaf and internal parameters 0 in the sampling 
schemes. Denote the conditional density of yi under leaf node z given all data points except 
yi as The non-conjugate case can be tackled by adapting similar techniques to 

the ones developed for non-conjugate DP mixtures [Neal, 2000]. 

Finally we specify priors on the hyper-parameters of the divergence function (Equation 
(??) in the footnote), c, and the diffusion precision r (the inverse of in Equation (??)); 

c ~ Gamma(ac, 6c), t ~ Gamma(aT-, 6 t) (4.1) 
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Here Gamma(a, b) is a Gamma distribution with shape a and rate b. In all experiments 
we used Oc = l,bc = 1, ttr = l,br = 1. Next we describe a Gibbs sampler for the DFP. 
Step 1: Sampling z. This can be realised by 


p{Zi = U}\z \c,t) 
oc < 

^ 


if cj is an existing leaf node 
if cj is a new leaf node 


(4.2) 


with uj' parent of uj. 

Here is the probability of reaching node uj from the root node leaving out Ui (as 
dehned in Equation (??)), and a,^ is the divergence function evaluated at depth |a;|. 
Intuitively, the above equation defines the two ways that Ui can be generated. In the 
first way, data item i follows an existing branch until it reaches a leaf node cj, which has 
probability Then this probability is multiplied with the likelihood term, giving us 
the total probability that yi is generated from node uj. In the second way, data item i 
initially follows an existing branch until it reaches (internal) node uj\ then it diverges 
from the current branch and creates a new leaf node, for which the total probability is 
simply the product of the probability of reaching node uj' , and the probability of diverging 
from uj'. Lastly, multiplied with a likelihood term, this gives us the probability of yi being 
generated from a new node. Note that updating the leaf assignment of each data point 
yi will also update the count vector and vice versa. In fact, this is the only way that 
z influences the other variables, i.e. (p and c. 

Step 2: Sampling divergence function hyperparameter c. The probability of 
the tree structure given the divergence function is simply the product of the probabilities of 
the branching structures for each internal node. Since at each internal node uj the process 
of descending to the children follows a GRP, the probability of a branching structure for 
each internal node is given by Equation (??). Goupled with the gamma prior, the 
Gibbs conditional probability for c is 

p(c|r, g, n, <p) cx Gamma (oc, &c) n 9- (4-3) 

(w: all internal nodes) 

Step 3: Sampling the precision r. It is straightforward to sample r given the node 
parameters cp. The probability of all node parameters p{(p) is simply the product of a set 
of Gaussians, since each node’s parameter distribution pi^puju^PpuP) is Gaussian. Goupled 
with a gamma prior, the Gibbs conditional probability for the precision r is given by 

p(r|c, g, n, p) cx Gamma(ar, br) x 

n n Gamma (l.i^as^hi)!), (4.4) 

(cj: all internal nodes) (ujcPj: children of a;) \ / 

In summary, for each observation the proposed Gibbs sampler iteratively samples a 
path leading to it conditioned on paths leading to remaining observations (note this is 
different from the Gibbs sampler for the nGRP topic model, which samples path leading 
to each observation in two separate steps, see Blei et al. [2010] for details). Most exist¬ 
ing inference procedures for trees employ a “prune-graft” algorithm; that is, hrst remove 
a subtree and then propose to re-attache the sub-tree elsewhere. The proposal is then 





Figure 4: Hierarchical clustering results on the Synthetic dataset. Left: tree structures 
sampled from the DDT model conditioning on the data. Right: tree structures sampled 
from the DFP mixture model conditioning on the data. 


DATASET 

GMM 

DPM 

DDT 

DFP mixture 

R15 

-2.127 ±0.158 

-0.759 ±0.122 

-0.861 ±0.123 

-0.705 ±0.086 

D31 

-2.593 ±0.036 

-1.790 ±0.040 

-1.798 ±0.044 

-1.654 ±0.022 

AGGR. 

-2.151 ±0.076 

-2.064 ±0.063 

-2.091 ±0.057 

-1.431 ±0.008 

MAGA. 

-15.039 ±0.584 

-15.145 ±0.611 

-14.816 ±0.546 

-12.725 ±0.127 

GGLE 

-4.8183 ±0.947 

-4.6036 ±0.331 

-3.7266 ± 0.457 

-2.825 ±0.495 


Table 1: Predictive log likelihood (logg) for GMM, DP mixture, DDT, and DFP mixture. 


accepted or rejected using an MH step. As we will show in the following section, empiri¬ 
cally this Gibbs sampler results in signihcantly improved performance when compared to 
state-of-the-art models using this “prune-graft” inference for both hierarchical clustering 
and density estimation. 

5 Experiments 

In this section we describe two sets of experiments to highlight the two aspects of the 
discrete time DFP mixture model: its hierarchical nature and its nonparametric den¬ 
sity modelling nature. To demonstrate the hierarchical nature of the DFP we compare 
the model to the agglomerative clustering algorithm. For the DFP, we implemented the 
inference algorithm described in Section 4. The software implements the discrete DFP 
with arbitrary depth, and is available at [URL]. We use Neal’s Flexible Bayesian Mod¬ 
elling (FBM) package for the DDT and Matlab’s implementation for the agglomerative 
clustering algorithm. 

5.1 Hierarchical Clustering 

First we compare the DFP mixture model to the agglomerative clustering algorithm. We 
performed experiments on four datasets (one hand crafted synthetic dataset and three 
real datasets). The real datasets we used are R15 (600 examples, 2 attributes Veenman 
et al. [2002]), Aggregation (referred to as AGGR, 788 examples, 2 attributes, Veenman 
et al. [2002]), and Glass (214 examples, 7 classes, 9 attributes). For the synthetic dataset, 
trees sampled from the posterior of the DFP mixture model and the DDT conditioning on 
the training data are shown in Figure 4. Both methods hnd a good hierarchical clustering 
of the data items. While the DDT is forced to choose a binary branching structure over 
the clusters, the DFP can represent a more parsimonious solution. Such parsimonious 
solutions are more interpretable and potentially lead to better explanations for the data. 
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Similar results are also observed for the real datasets. The results on the AGGR and R15 
datasets are shown in Figure 3. As we can see from Figure 3, most data points with the 
same class label are merged in the hrst level of the DFP mixture model, which leads to a 
clean summary of the structure of the data. 

Furthermore, in order to assess the quality of these hierarchical clustering results, we 
also computed the tree purity score for various algorithms on the Glass dataset; the tree 
purity score was introduced by Heller and Ghahramani [2005] and motivated as a rea¬ 
sonable metric for evaluating hierarchical clustering algorithms. On the Glass dataset 
the purity scores are 0.5064 (DFP), 0.4815 (agglomerative, average linkage), and 0.4568 
(DDT). The result of the agglomerative algorithm are consistent with those reported in 
Heller and Ghahramani [2005]. However, while Heller and Ghahramani [2005] ’s Bayesian 
Hierarchical Glustering algorithm exhibits lower purity score when compared to the ag¬ 
glomerative algorithm on the Glass dataset, the DFP mixture model produces a slightly 
better one. 


5.2 Density Estimation 

To evaluate the power of the DFP in density estimation, we compare the DFP mixture 
model to traditional mixture models including the Gaussian Mixture Model (GMM), 
the Dirichlet Process Mixtures (DPM), and the Dirichlet Diffusion Tree (DDT) over 5 
datasets. The 5 datasets we used are the macaque skull measurements (MAGA, 228 
examples, 10 attributes), R15, Aggregation, D31 (3100 examples, 2 attributes), and the 
Gancer cell line encyclopedia (GGLE, 504 examples, 24 attributes). In particular, the 
GGLE dataset consists of measurements of the sensitivity of 504 cancer derived cell lines 
to 24 drugs. Such data has the potential to help biologists understand the relationship 
between different cancer types and drug effects, and to aid in clinical practice [Barretina 
et ah, 2012]. 

For all datasets we train each model using 90% of the data and report the predictive 
log likelihood for the remaining 10% of the data. For the DFP, we set the depth of 
the tree at L = 4. For all methods under comparison, we run the MGMG inference 
algorithm until the predictive log likelihood for the train data converges. As shown 
in Table 1, on all datasets the DFP mixture model obtains the highest predictive log 
likelihood. For the MAGA dataset, the DFP mixture model outperforms all previous 
models: the performance of the model is 2.5 orders better (on logg scale). This is a 
signihcant improvement as previous attempts on the same dataset only obtained a small 
improvement, as reported in Knowles and Ghahramani [2011] and Adams et ah [2008]. 
The improvement of the DFP over existing methods is consistent with all other datasets 
we tried, in particular, the performance on the GGLE is about 1 order better. 

6 Discussion 

This paper have presented the Dirichlet fragmentation process for modelling tree struc¬ 
tures. The DFP is derived as a useful variant of fragmentation processes, and is connected 
to a number of existing models such as Neal [DDT 2003], Blei et al. [nGRP 2010], Adams 
et al. [TSSB 2010], Knowles and Ghahramani [PYDT 2011], Rodriguez et al. [nDP 2008]. 
Particularly, we derived a simple hierarchical mixture model based on the DFP, and an ef- 
hcient Gibbs-style sampler. This DFP hierarchical mixture model generalises the popular 
Dirichlet process mixture model. Unlike the latter, which partitions data into a flat layer 
of clusters, the DFP mixture model organises clusters into a tree structure. Not only this 
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provides more interpretable summary of the data, but also leads to significantly better 
accuracy as demonstrated in the density estimation experiments. Future theoretical work 
will study the connection between the DFP and hierarchical DPs, and extends the DFP 
to model group data and sequential data. 
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